Fetching data

We first fetch state-level JHU CSSE confirmed cases and deaths using the API from the COVIDcast project. Here and throughout we look at the 7-day smoothed versions of these signals (via 7-day trailing averages).

library(covidcast) # https://github.com/cmu-delphi/covidcast
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2) 
theme_set(theme_bw())

start_day = "2020-03-01"
as_of = "2020-11-16" # Fetch data "as of" the day we first did this analysis!

cases = covidcast_signal(data_source = "jhu-csse",
                         signal = "confirmed_7dav_incidence_num", 
                         start_day = start_day, geo_type = "state",
                         as_of = as_of)

deaths = covidcast_signal(data_source = "jhu-csse",
                          signal = "deaths_7dav_incidence_num", 
                          start_day = start_day, geo_type = "state",
                         as_of = as_of)

Lagged correlations

Now we compute correlations for between national cases and deaths, for a bunch of different lags for the cases signal. We restrict our attention to data after April 1.

Min = function(x) min(x, na.rm = TRUE)
Max = function(x) max(x, na.rm = TRUE)
Sum = function(x) sum(x, na.rm = TRUE)
Mean = function(x) mean(x, na.rm = TRUE)
Median = function(x) median(x, na.rm = TRUE)

compute_cor = function(x, y, lags = 0:40, start_day = NULL, end_day = NULL, 
                       ...) {
  if (is.null(start_day)) start_day = Min(x$time_value)
  if (is.null(end_day)) end_day = Max(x$time_value)
  start_day = as.Date(start_day); end_day = as.Date(end_day)
  x = x %>% filter(between(time_value, start_day, end_day))
  y = y %>% filter(between(time_value, start_day, end_day))
  
  x = map_dfr(lags, function(lag) {
    covidcast_cor(x, y, dt_x = -lag, ...) %>% mutate(lag = lag)
  })
  
  attributes(x)$start_day = start_day
  attributes(x)$end_day = end_day
  return(x)
}

df = compute_cor(cases, deaths, start_day = "2020-04-01")

Note that we now have one correlation per state and lag value. Now for each lag, we try aggregating them (over states) in three different ways: median, mean, and population-weighted mean.

# Summarize by median, mean, population-weighted mean
df_summary = state_census %>% 
  select(POPESTIMATE2019, ABBR) %>%
  rename("pop" = POPESTIMATE2019, "geo_value" = ABBR) %>%
  mutate(geo_value = tolower(geo_value)) %>%
  inner_join(df, by = "geo_value") %>%
  group_by(lag) %>% 
  summarize(Median = Median(value),
            Mean = Mean(value), 
            WeightedMean = Sum(value * pop) / Sum(pop)) %>%
  pivot_longer(cols = c("Median", "Mean", "WeightedMean"),
               names_to = "metric", values_to = "value") 

# Now compute the maximizing lag for each metric
df_argmax = df_summary %>% 
  group_by(metric) %>%
  filter(value == Max(value))

title = "Correlation between lagged cases and deaths"
subtitle = sprintf("State-level, from %s to %s", 
                   format.Date(attributes(df)$start_day, "%B %d"), 
                   format.Date(attributes(df)$end_day, "%B %d"))

ggplot(df_summary, aes(x = lag, y = value)) +
  geom_point(aes(color = metric)) + 
  geom_line(aes(color = metric)) +
  geom_vline(data = df_argmax,
             aes(xintercept = lag, color = metric),
             size = 0.75, show.legend = FALSE) + 
  labs(x = "Lag (days)", y = "Correlation", title = title, 
       subtitle = subtitle) +
  theme(legend.position = "bottom", legend.title = element_blank())

The median and mean curves are quite different; suggesting that the distribution of correlations per lag (over states) is skewed, and also likely also has a big spread given the gap between the median and mean. The mean curve is maximized at 19 days, but has very little curvature around its maximum. The median curve is maximized at 16 days, but has a second local max 27 days. From this picture, it seems difficult to look at any one lag as the “right” one.

Below we plot each correlation per lag value (over states) and overlay the median and mean curves on top. Since the population-weighting didn’t really do much to affect the mean curve, we exclude it.

ggplot(df, aes(x = lag, y = value)) +
  geom_point(alpha = 0.25) +
  geom_line(data = df_summary %>% 
              filter(metric != "WeightedMean"),
            aes(x = lag, y = value, color = metric), 
            size = 1.25) +  
  labs(x = "Lag (days)", y = "Correlation", title = title, 
       subtitle = subtitle) +
  geom_vline(data = df_argmax %>% 
               filter(metric != "WeightedMean"),
             aes(xintercept = lag, color = metric),
             size = 0.75, show.legend = FALSE) +
  theme(legend.position = "bottom", legend.title = element_blank())

As we guessed, there’s both a lot of asymmetry and a huge spread. So it seems pretty difficult to agree on 16 vs 19 vs 27 as the “right” lag with any amount of confidence.

Next we look at the distribution of maximizing lags per state.

# Compute the maximizing lag per state
df_argmax_by_state = df %>% 
  group_by(geo_value) %>%
  filter(value == Max(value)) %>%
  ungroup() %>%
  group_by(lag) %>%
  mutate(y = 1:n()) %>%
  ungroup()

ggplot(df_argmax_by_state, aes(x = lag, y = y, label = geo_value)) + 
  geom_point(size = 3, alpha = 0.25) + geom_text() +
  labs(x = "Lag (days)", y = "Count", title = "Maximizing lags by state", 
       subtitle = subtitle)

Several states have maximizing lags somewhere around 20, but there’s still a huge spread. Once again, it’s hard to walk away with a clear picture for the “right” lag.

CFR estimation: national level

We compute the case fatality ratio (CFR), for various lags. For each lag value \(k\), we actually shift the death signals forward in time \(k\) days, then take a ratio to current cases. Thus the interpretion of the (say) CFR using a lag of 10 estimated on June 1 is as follows: of the cases that arrive June 1, it gives the fraction that will die on June 11.

# Add national signals by summing up cases and deaths
append_nat = function(x) {
  x %>% 
    select(data_source, signal, time_value, geo_value, value) %>%
    bind_rows(x %>% 
                group_by(data_source, signal, time_value) %>%
                summarize(geo_value = "us", value = Sum(value))) %>%
    mutate(issue = time_value, lag = NA, stderr = NA, sample_size = NA)
}

cases = append_nat(cases)
deaths = append_nat(deaths)

# Aggregate cases and deaths with dt = 10:30 
signals = aggregate_signals(list(cases, deaths), dt = list(0, 10:30))

# Divide each of the death signals by cases
scale_by = function(x, by, prefix = "value") {
  x %>% 
    mutate(across(starts_with(prefix), ~ 100 * .x / !!as.symbol(by))) %>%
    select(-!!as.symbol(by)) %>%
    pivot_longer(cols = starts_with(prefix),
                 names_to = "name", values_to = "value") %>%
    separate(col = "name", into = c("dt", "rest"), sep = ":") %>%
    mutate(dt = as.numeric(sub("value", "", dt))) %>%
    select(-rest)
}

cfr = scale_by(signals, by = "value+0:jhu-csse_confirmed_7dav_incidence_num")

First we plot the CFR nationally. We restrict our attention to data after April 1, and to lags in increments of 5 for visibility.

ggplot(cfr %>% filter(geo_value == "us",
                      time_value >= "2020-04-15",
                      dt %in% seq(10, 30, by = 5)),
       aes(x = time_value, y = value)) + 
  geom_line(aes(color = as.factor(dt))) +
  geom_hline(yintercept = 1.5, linetype = 2) +
  scale_x_date(breaks = "months", date_labels = "%b") +
  labs(x = "Date", y = "Case fatality ratio (%)", 
       title = "Case fatality ratio, US", color = "Lag") 

We can see that the CFR has come down from the scary numbers we were seeing in the early stages of the pandemic and has been mostly flat since July, hovering somewhere above 1.5% (the horizontal dashed line). If we believe the shorter lags are relevant, then it looks like the CFR is possibly dropping in early November; but if we believe the longer lags are more relevant, then this really remains to be seen.

Next we zoom on on July 1 through current day.

ggplot(cfr %>% filter(geo_value == "us",
                      time_value >= "2020-07-01",
                      dt %in% seq(10, 30, by = 5)),
       aes(x = time_value, y = value)) + 
  geom_line(aes(color = as.factor(dt))) +
  geom_hline(yintercept = 1.5, linetype = 2) +
  scale_x_date(breaks = "months", date_labels = "%b") +
  labs(x = "Date", y = "Case fatality ratio (%)", 
       title = "Case fatality ratio, US", color = "Lag") 

CFR estimation: state level

We do the same as in the last part, but for each state individually.

ggplot(cfr %>% filter(geo_value != "us",
                      time_value >= "2020-04-15",
                      dt %in% seq(10, 30, by = 5)),
       aes(x = time_value, y = value)) + 
  geom_line(aes(color = as.factor(dt))) +
  geom_hline(yintercept = 1.5, linetype = 2) + 
  coord_cartesian(ylim = c(0, 8)) +
  labs(x = "", y = "", color = "Lag") +
  scale_x_date(breaks = "months", date_labels = "%b") +
  facet_wrap(~ geo_value, ncol = 3, scales = "free_x") +
  theme(legend.position = "top")

Forecasting: national errors

We consider forecasting \(k\)-day ahead national deaths based on current cases and two different mechanisms, with respect to the CFR:

  1. local: use most recently-available estimated CFR, for that lag value \(k\);
  2. global: use a constant CFR value throughout.

In aggregating errors, we restrict our attention on July 1 through current day.

# Join CFR along with cases and future deaths
cfr_signals = inner_join(
  inner_join(cfr %>% rename("cfr" = value),
             deaths %>% 
               select(time_value, geo_value, value) %>%
               rename("deaths" = value),
             by = c("geo_value", "time_value")),
  aggregate_signals(cases, dt = -(10:30), format = "long") %>%
    select(geo_value, time_value, dt, value) %>%
    mutate(dt = -dt) %>%
    rename("cases-dt" = value),
  by = c("geo_value", "time_value", "dt"))
    
# Consider the following global values for CFR
cfr_vals = c(1.4, 1.7, 2.1)

# Now make forecasts k days into the future, for each k in dt = 10:30, using:
# 1. the local CFR computed for that date and location, and
# 2. the global CFR values 
cfr_pred = cfr_signals %>% mutate(type = "cfr-local") %>%
  rbind(map_dfr(cfr_vals, function(val) {
    cfr_signals %>% mutate(cfr = val, type = paste("cfr-global:", val))
  })) %>%
  mutate(pred = cfr * `cases-dt` / 100, lower = NA, upper = NA)
    
plot_err = function(x, geo_value, start_day = NULL, end_day = NULL) {
  if (is.null(start_day)) start_day = Min(x$time_value)
  if (is.null(end_day)) end_day = Max(x$time_value[!is.na(x$deaths)])
  start_day = as.Date(start_day); end_day = as.Date(end_day)
  given_geo_value = geo_value
  
  ggplot(x %>% 
           filter(geo_value == given_geo_value,
                  between(time_value, start_day, end_day)) %>%
           group_by(dt, type) %>%
           summarize(error = Mean(abs(deaths - pred))) %>%
           filter(!is.na(error)),
         aes(x = dt, y = error)) +
    geom_point(aes(color = type)) + 
    geom_line(aes(color = type)) + 
    labs(x = "Lag", y = "Mean absolute error (deaths)", color = "",
         title = paste("Forecast errors,", toupper(geo_value)),
         subtitle = sprintf("from %s to %s", 
                            format.Date(start_day, "%B %d"), 
                            format.Date(end_day, "%B %d"))) 
}

plot_err(cfr_pred, geo_value = "us", start_day = "2020-07-01") 

We can see that, pretty much across the board (for any lag value), the global forecaster with CFR = 1.7% is the most accurate: easily more accurate than the other two global forecasters, and a little more accurate than the local one. It appears that the global CFR = 1.7% forecaster is most accurate at lag 16, that is, 16-day-ahead forecasts.

Next we repeat this but restricting to forecasts made after August 15.

plot_err(cfr_pred, geo_value = "us", start_day = "2020-08-15") 

We see that the CFR = 1.7% forecaster is still the most accurate, but the best lag drifts up to somewhere close to 20, contributing more evidence that there is not one “right” lag (certainly not a fixed answer throughout time). Also, the local CFR forecaster is more competitive.

Forecasting: national trajectories

For simplicity, we now restrict our attention to the global CFR = 1.7% model. We look at forecasts 16, 22, and 28 days ahead (these the lags we use for the CFR model). Below we plot the predictions along with actual deaths, and we include 95% bands based on quantiles of residuals. In all cases, the black line shows the observed deaths (recall, smoothed via a 7-day trailing average).

An important note: these bands have by construction 95% historical coverage. They have not been calibrated (nor been attempted to be calibrated) to have 95% future coverage. Therefore they should be looked at with skepticism when it comes to the actual forward-look predictions made by the CFR = 1.7% forecaster (indeed they look far too narrow here).

plot_pred = function(x, geo_value, dt, type, start_day = NULL, end_day = NULL) {
  if (is.null(start_day)) start_day = Min(x$time_value)
  if (is.null(end_day)) end_day = Max(x$time_value)
  start_day = as.Date(start_day); end_day = as.Date(end_day)
  given_geo_value = geo_value; given_dt = dt; given_type = type
  
  # Filter down to what we need
  x = x %>% 
    filter(geo_value == given_geo_value, 
           type %in% given_type, dt == given_dt,
           between(time_value, start_day, end_day)) 
  
  # Add quantiles, predictions (for anything else than ensemble model)
  x_list = split(x, x$type)
  for (ty in given_type) {
    if (ty != "ensemble") {
      x_list[[ty]] = make_pred(x_list[[ty]], geo_value, dt, ty)
    }
  }
  x = do.call(rbind, x_list)
  
  ggplot(x, aes(x = time_value)) +
    geom_ribbon(aes(ymin = lower, ymax = upper, fill = type),
                alpha = 0.2, show.legend = FALSE) +
    geom_line(aes(y = pred, color = type)) +
    geom_line(aes(y = deaths), color = "black") +
    #scale_color_manual(values = colors) +
    labs(x = "Date", y = "Deaths", color = "", fill = "",
         title = sprintf("%s-day-ahead forecast, %s", given_dt, 
                         toupper(geo_value)))
}

make_pred = function(x, given_geo_value, given_dt, given_type) {
  # Add forward-looking predictions 
  given_cfr = x %>% filter(time_value == max(time_value)) %>% pull(cfr)
  x = x %>% bind_rows(
    cases %>% filter(geo_value == given_geo_value,
                     time_value >= max(time_value) - given_dt) %>%
      select(time_value, geo_value, value) %>%
      mutate(dt = given_dt, cfr = given_cfr) %>%
      rename("cases-dt" = value) %>% 
      mutate(pred = cfr * `cases-dt` / 100, 
             time_value = time_value + dt, 
             type = given_type))

  # Add quantiles, and then return
  x %>% 
    mutate(lower = pred + quantile(deaths - pred, prob = 0.025, na.rm = TRUE), 
           upper = pred + quantile(deaths - pred, prob = 0.975, na.rm = TRUE))
}

plot_pred(cfr_pred, geo_value = "us", dt = 16, type = "cfr-global: 1.7", 
          start_day = "2020-07-01")

plot_pred(cfr_pred, geo_value = "us", dt = 22, type = "cfr-global: 1.7", 
          start_day = "2020-07-01")

plot_pred(cfr_pred, geo_value = "us", dt = 28, type = "cfr-global: 1.7",
          start_day = "2020-07-01")

Ensemble comparison

Lastly, how accurate are these CFR-based forecasts compared to those from the COVID Forecast Hub, which serves as the official data source for the CDC’s communications on COVID-19 forecasting? To answer this, we fetch the COVID Hub ensemble’s forecasts (which has shown, consistently, to be typically more robust and accurate that any the individual component forecasters).

library(covidHubUtils) # https://github.com/reichlab/covidHubUtils

quiet = function(x) { 
  sink(tempfile()) 
  on.exit(sink()) 
  invisible(force(x)) 
} 

ens = quiet(load_forecasts(models = "COVIDhub-ensemble",
                           types = c("point", "quantile"), 
                           location = "US",
                           targets = c("1 wk ahead inc death", 
                                       "2 wk ahead inc death", 
                                       "3 wk ahead inc death",
                                       "4 wk ahead inc death")))

# Wrangle to match the format of CFR predictions, join together
cfr_ens_pred = cfr_pred %>%
  bind_rows(
    ens %>% 
      select(-c(model, forecast_date, temporal_resolution, target_variable, 
                type)) %>%
      filter(is.na(quantile) | quantile %in% c(0.025, 0.975)) %>%
      pivot_wider(names_from = "quantile", values_from = "value") %>%
      rename("geo_value" = location, "time_value" = target_end_date, 
             "dt" = horizon, "pred" = `NA`,
             "lower" = `0.025`, "upper" = `0.975`) %>%
      mutate(type = "ensemble", geo_value = tolower(geo_value), 
             dt = as.numeric(dt) * 7,  pred = pred / 7,
             lower = lower / 7, upper = upper / 7) %>%
      left_join(deaths %>%
                  rename("deaths" = value) %>%
                  select(time_value, geo_value, deaths), 
                by = c("time_value", "geo_value")))

Here is the comparison of forecast errors.

plot_err(cfr_ens_pred, geo_value = "us", start_day = "2020-07-01") 

plot_err(cfr_ens_pred, geo_value = "us", start_day = "2020-08-15") 

Interestingly this reveals that the CFR = 1.7% forecaster is quite competitive with the COVID Hub ensemble, over both periods (after July 1, and after August 15), though only for longer lags (farther-ahead forecasts).

However, a critical point: this is not a perfectly fair comparison! We only considered the global CFR = 1.7% forecaster in the first place because it looked interesting in hindsight. In other words, based on all data available as of today, we are asking how a CFR of 1.7% does in terms of forecast accuracy retrospectively. The ensemble forecaster is, of course, prospective (it is a true forecast) and does not enjoy the benefit of hindsight.

Here is now the comparison of forecast trajectories: moving to 14, 21, and 28 days ahead (multiples of 7) because that is when the ensemble is available.

plot_pred(cfr_ens_pred, geo_value = "us", dt = 14, 
          type = c("cfr-global: 1.7", "ensemble"), start_day = "2020-07-01")

plot_pred(cfr_ens_pred, geo_value = "us", dt = 21, 
          type = c("cfr-global: 1.7", "ensemble"), start_day = "2020-07-01")

plot_pred(cfr_ens_pred, geo_value = "us", dt = 28, 
          type = c("cfr-global: 1.7", "ensemble"), start_day = "2020-07-01")

In all cases (all days ahead), the CFR-based forecaster is right at or right above the upper endpoint of the ensemble’s prediction band. Though we can also see that the ensemble has been tending to underpredict in the recent weeks, particularly for 21 and 28 days ahead.